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Abstract 



A characteristic feature of samples of curves is the presence of time variability in addition 
to amplitude variability. Existing functional regression methods do not handle time vari- 
ability in an efficient way. In this paper we propose a functional regression method that 
incorporates time warping as an intrinsic part of the model, and then attains high predictive 
power in a parsimonious way, avoiding overfitting. The properties of the estimators are 
studied by simulation, and an application to the study of human lip movement is presented. 

Key Words: Causal Linear Regression; Functional Data Analysis; Registration; Spline 
Smoothing; Time Warping. 



1 Introduction 



Many statistical applications today involve modeling curves as functions of other curves. 
For example, the evolution of CD4 cell counts in HIV patients can be modeled as a function 
of viral load (Liang et al. 2003, Wu and Liang 2004, Wu and Miiller 2010); gene expression 
profiles of insects at the pupa stage can be modeled as functions of gene expression profiles 
at the embryonic stage (Miiller et al. 2008); systolic blood pressure trajectories over the 
years can be predicted to some extent from body mass index trajectories (Yao et al. 2005). 
These are just a few examples that show the broad applicability of functional regression 
methods. 

Functional regression, as applied in the papers cited above, is a more or less straightfor- 
ward extension of multivariate linear regression to the functional-data framework (Ramsay 
and Silverman 2005, ch. 16). Recent developments have focused on theoretical aspects 
such as rates of convergence (Cai and Hall 2006, Hall and Horowitz 2007, Crambes et 
al. 2009), sparse longitudinal data (Yao et al. 2005), and interpretability of the estima- 
tors (James et al. 2009). But a problem inherent to functional data that has received little 
attention in the regression context is the problem of time variability. 

As a motivating example, consider the problem in Malfait and Ramsay (2003). The au- 
thors wanted to predict lip acceleration using electromyography (EMG) curves that mea- 
sure neural activity in the primary muscle depressing the lower lip, the depressor labii 
inferior. A person was asked to repeat the phrase "say Bob again" a few times, and the lip 
movement and associated EMG curve corresponding to the word "Bob" were recorded. 
Lip acceleration curves were obtained by differentiating the smoothed lip trajectories. The 
sample curves, time- standardized to 700 msec, are shown in Figure \T\ Both types of 
curves follow regular patterns, but they show considerable variability both in amplitude 
and timing of the main features. In fact, time variability overwhelms amplitude variabil- 
ity in Figure [It a), to the point that it is not clear whether a typical EMG curve has three 
peaks or four. A pair-by-pair analysis of the curves shows that the EMG spikes are well 
aligned with certain features of the acceleration curves; therefore, the timing of the EMG 
spikes (not just their amplitude) is likely to provide valuable information for predicting lip 
acceleration. 

The classical functional linear regression model does not explicitly model the dynam- 
ics of the curves; instead, it sees time variability as a nuisance. Time variability tends to 
spread the features of both the predictor and the response curves over wide time ranges, 
with the result that the regression function needs to be very irregular to provide a good fit 
to the data; overfitting may become a problem for small sample sizes, like those in Figure 
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Figure 1: Lip Movement Example, (a) EMG curves; (b) lip acceleration curves. 



[T] (which consists of only 29 curves). On the other hand, if the curves are well aligned, 
a smoother and more localized regression function will be sufficient to provide a good 
fit to the data. The situation is similar to the problem of estimating the Karhunen-Loeve 
components: although in theory any squared-integrable sample can be approximated by 
a linear combination of orthogonal functions, in practice it is more efficient to remove as 
much of the time variability as possible first by synchronizing the data, and then to esti- 
mate the (fewer) Karhunen-Loeve components of the aligned curves. For a more general 
discussion of the registration problem see Ramsay and Silverman (2005, ch. 7) and Kneip 
and Ramsay (2008). 

A similar approach could be attempted for the regression problem; that is, to align 
the curves first and then fit a regression model to the synchronized curves. There are 
many warping methods available for this, such as Gervini and Gasser (2004, 2005), James 
(2007), Kneip et al. (2000), Liu and Muller (2004), Ramsay and Li (1998), Tang and 
Miiller (2008, 2009), and Wang and Gasser (1999). But the registration of the covariates 
and the responses would need to be carried out independently, and it would not be possible 
to predict the latter from the former, unless a parallel regression model is also set up for 
the warping functions. This would be cumbersome and statistical inference would be 
complicated. 

Given that time variability cannot be ignored but there are no methods that handle 
time variability efficiently in a regression context, we propose in this paper a regression 
method that incorporates time warping as an intrinsic part of the model. Since we are 
going to apply this method to the above lip movement data, we will focus on the causal 
regression model, or "historical linear model" as Malfait and Ramsay (2003) call it; that 
is, the model does not use the whole trajectory of a covariate to predict a response value 
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at time t, but only those observations preceding time t. In addition, we will assume that 
the sample curves are smooth curves; in effect, we are assuming that the raw data has 
been pre-processed by smoothing. Incorporating warping into the traditional (non-causal) 
functional regression model and modeling sparse and irregular trajectories are important 
problems but they will be addressed in other papers. 

This paper is organized as follows. The dynamic causal regression model is intro- 
duced in Section [21 A convenient semiparametric family of warping functions is discussed 
in Section [3] Asymptotics and inference are discussed in Section HI The finite-sample 
behavior of the estimator is studied by simulation in Section [5] Finally, the lip move- 
ment data mentioned above is analyzed in more detail in Section [6] Matlab® programs 
implementing these methods are available on the author's website. 

2 Dynamic causal regression 
2.1 The model 

Let (xi, . . . , (x n , y n ) be a sample of functions, where Xi is the covariate and yi the 
response. We assume Xi and yi are square-integrable functions on a the same time interval 
[a, b] . A linear predictor of y^t) based on Xi has the form 



where a is the intercept function and (3 the slope function. However, (OQ) employs the 
whole trajectory {xi(s) : s G [a, b}} to predict the value of yi at t, including "future" 
observations Xi(s) with s > t. In many applications this is not reasonable. For example, 
for the lip movement data in Figured] it is clear that future neural activity cannot have an 
influence on past lip movement; therefore, prediction of yi(t) must be based only on the 
partially observed curves {xi(s) : s G [a, t]}. We incorporate this restriction into model 
(Q]) by assuming (3(s, t) = for s > t. This model is called "historical linear model" by 
Malfait and Ramsay (2003), although we prefer to call it "causal linear regression". 

As explained in Section Q3 linear regression works best for synchronized curves. Sup- 
pose, then, that for each pair (xi,yi) we have a warping function W{ : [a, b] — > [a, b], 
strictly monotone increasing and such that Wi(a) = a and Wi(b) = b. Let Xi = Xi o Wi and 
V% — Vi° Wi be the warped curves; then we can apply © to (xi, yi) rather than (a;^ ?/;), and 




(1) 
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the dynamic functional predictor of Ui{t) is defined as 



L{w-\t)- x h a, 0) = a{wr\t)) + f 0{s, ^r 1 (i))^(«; i (s)) ds. (2) 

J a 

Note that the same warping function u>j is used for Xi and this is reasonable for the 
type of applications we have in mind. In particular, using a common warping function 
preserves the causality of the model: if /3(s, t) = for s > t, the integral in © only uses 
those values of Xi(wi(s)) with s < w,~ 1 (t), which is to say only the values of Xi(u) with 
u<t. 

The estimators of a, (3, and the tUjS can be obtained by least squares: if ||/|| := 
{fa Pifydt} 1 ^ 2 is me usual L 2 norm, we minimize 

n 

^2 hi ow i ~L{-\Xi o Wi ,a,(3)\\ 2 (3) 
i=i 

subject to the identifiability condition Y^7=i w i(t)/ n = t- F° r given P and WiS, the a that 
minimizes © is clearly 

a(t) =jj(t) - J P(s,t)&(s) ds, (4) 

so we only need to focus on the estimation of f3 and the WiS. For the warping functions 
we will use the semiparametric family described in Section |3j For (3 we will also use a 
semiparametric family: truncated tensor splines. Specifically, if {ipj(s)} is a (univariate) 
B-spline basis, we define ^^(s,t) = ^(s)^-^) I{s < t}, drop the ^ -s that are zero 
everywhere, and re-scale the remaining p functions so that J2k=i 'Pki 3 ^ t) = 1. Then any 
f3 of the form /3(s, t) = J2k=i bk4>k{ s -> satisfies the causality restriction /3(s, t) = for 
s > t. Other bases are possible, such as the simplicial basis used in Malfait and Ramsay 
(2003). 

Typically the dimension p of this space will be large and then the regularity of j3 must 
be controlled by some form of roughness penalty. In the smoothing literature it is common 
to penalize the curvature of the functions, that is, the L 2 norm of the second derivative, 
which for (3 would translate into J f \\Hj3(s, t) \\ 2 F ds dt, where H is the Hessian and || • \\ F 
the Frobenius norm. However, in the regression context it is more natural to penalize the 
size of (3, as it is done for instance in ridge regression, so we will use J f /3 2 (s, t) ds dt as 
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penalty and minimize 



^ \\yi°Wi - L(-;Xi o w i} a,f3)\\ 2 + A / / (3 2 (s,t) ds dt. (5) 
i=i J 

The penalization parameter A can be chosen via Akaike-like criteria, discussed below. 



2.2 Estimation 

At this point it is convenient to introduce some notation. If / and g are two square- 
integrable functions on [a, b], the usual inner product is defined as (/, g) := f(t)g(t) dt 
(for bivariate functions such as (3(s, t) it is defined analogously.) If now f : [a, b] — » M p 
and g : [a, b] — > W are vector-valued functions with square-integrable elements, we 
denote by (f , g T ) the p x q matrix with elements (f i} gj) (in other words, the integration 
is carried out in a componentwise manner.) With this notation, ((f, g T )) T = (g, f T ) 
and A(f , g T ) = (Af , g T ) for any matrix A of dimensions compatible with f . Then the 
penalized least- squares problem © can be rewritten as 

1 " 

min - V Ik* - b T Zl || 2 + \b T £lb, (6) 

b,{wi} n f—' 11 11 

where y* := y { - y, x* := x { - x, z^t) = t),x*), and Cl = (0, 4> T ), with 0(s, t) = 

Oi(M),...,0 p (s,t)) T . 

The minimization problem © is not quadratic in b due to the dependence of x* and 

y* on the warping functions, but it is conditionally quadratic for given warping functions. 
Therefore we propose an alternating least-squares algorithm: (i) given warping functions 
{■Wi}, find the b that minimizes ©, which is simply the ridge estimator 

/ n \ — n 

b A = ^( Zi ,zf>+nAn ^(z,,^); (V) 

\i=l / i=l 

( n'J given b, for each % — 1 , . . . , n find the w,i that minimizes 1 1 y* — b T z j 1 1 2 over the family 
of warping functions discussed in Section Steps (z) and (ii) are alternated until conver- 
gence. Note that step ( ii) by itself would require many iterations if the exact minimizer Wi 
were to be computed, but to save time we recommend to perform just one step towards the 
solution on each iteration. 

Once the estimators 6c, j3 and wi, . . . ,w n have been obtained, it is possible to use © to 
predict a response function y n +i for a new covariate function x n+ \. The natural predictor 
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of y n+ i given x n+1 is y n+ i(t) = L^l^t); x n+1 , a, 0), but w n+l cannot be obtained by 
minimizing © because that would involve the unobserved response y n+ \. Instead, w n+ i 
can be obtained from x n+ \ alone by registering x n+ \ to the mean of the warped XiS: 

w n+1 = argmin \\x n+1 o w - /y 2 , (8) 

w 

where /i 5 = n^ 1 Y^=i Note that £1% is fixed here, so the "pinching" or "overwarping" 
problem associated with least-squares registration [discussed in Ramsay and Silverman 
(2005, ch. 7.6) and Kneip and Ramsay (2008)] will not be a serious issue. 

2.3 Criteria for penalty parameter selection 

A crucial issue for the good performance of the estimator is selection of the penalty pa- 
rameter A. In the literature this is typically done by cross-validation or by an Akaike-type 
criterion [see e.g. Hastie et al. (2009)]. A variant of Akaike's criterion that we found 
particularly useful is the "corrected Akaike criterion" or AICC, proposed by Hurvitz et 
al. (1998). To the best of the author's knowledge, functional-data equivalents of the AICC 
criterion and of the notion of "effective degrees of freedom" (Hastie et al., 2009) have not 
yet been derived in the literature, so we do it in detail in this section. 

To begin with, we need a functional equivalent of the "hat matrix", an operator : 
{L 2 ([a,b])} n -> {L 2 ([a,b})} n such that y*(t) = (IKy*)(t). Let Z(t) denote the n x p 
matrix with rows zf(t). Then from © we have 

n 

F(t) = Z(t)b A = Z(t)A^( Zl ,y*>, 

i=l 

where A A = Er=i( z i> z f) + xn - Note that > with our notation, EILi( z i>£*) = ( zT >y*) 
and Er=i( z i' z D = ( zT > z ) = : A - Then the " nat operator" "K is given by (!Kf)(t) = 
Z(t)A A ^ 1 (Z T , f), where f is an n-dimensional vector of square-integrable functions. 

The hat operator "K is an integral operator: we can write (!Kf)(t) = J b R(s, t)f(s) ds, 
with R(s, t) = Z(t) A A " 1 Z T (s). The trace of an integral operator is defined as tr{R(s, s)}ds, 
which like its finite-dimensional analogue is equal to the sum of the eigenvalues (Gohberg 
et al., 2003). For our particular operator 'K it is easy to verify that tr(IK) = tr(A^ 1 A), so 
no computations beyond those required for b\ are involved. 

Following Hastie et al. (2009) we define df (A), the "effective degrees of freedom" of 
the model, as tr(IK). And following Hurvitz et al. (1998) we define the "corrected Akaike 
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criterion" as 

AICC(A) = MSE(A) exp[l + 2{df(A) + l}/{n - df (A) - 2}], 

where MSE(A) = n^ 1 EILi Wvt ~ Vi\\ 2 - We can also define a functional equivalent of the 
"generalized cross validation" criterion of Wahba (1990): 

GCV(A) = MSE(A)/{1 - df(A)/n} 2 , 

but this criterion generally leads to undersmoothing, so we will use AICC(A) in this paper. 
The performance of AICC(A) is studied by simulation in Section[5l 



3 A semiparametric family of warping functions 

The warping functions Wi can also be modeled semiparametrically, but due to the mono- 
tonicity restriction some care must be taken in the choice of basis functions. We found that 
a convenient family are the cubic Hermite splines [see e.g. Fritsch and Carlson (1980)]. 
Details are given in the technical supplement, but the basic idea is the following. Let 
r = (r , Ti, . . . , r r+2 ) be a vector of knots in [a, b], with a = r < T\ < ■ ■ ■ < r r+2 = b. 
The knots can be chosen arbitrarily (for example, equispaced point in [a, b}), or they can 
be made to correspond, roughly, to the average location of the salient features of the 
curves. For example, in Figure [Ha) one would choose r = (0, .08, .2, .5, .69) or per- 
haps r = (0, .08, .2, .4, .5, .69). A cubic Hermite spline can be expressed as Wi(t) = 

EfcS c ikVk(t; T ) + E!£o d ik£*(*; r )> with 



h 



Tl-r 

t Tk+l—Tk _ 

U I Tr+l-t 



+ h 



Tk—Tk-l 



if jfe = 

if k = 1,..., 

if k — r + 1, 



for h (t) = (1 + 2t)(l - t) 2 I m (t), and 



h (^) (r fc+1 - r h ) - h (^_L_) (r fc - 

T r + l—t 



T r+l~ T r 



if A; = 

if k = 1, . . . , r 

if A; = r + 1, 



for/i 1 (t) = t(l-t) 2 / [0il] (t). 
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Note that r] k (T k ; r) = 1 and T) k (Tj] r) = for j ^ k, while i k ( T v r) = for all 
j, so Wi(rk) = Cik- Then the coefficients Cj& are interpretable: if the r fe s are average 
"landmarks", the c^s would be the corresponding landmarks of the zth curve, and the 
warping function Wi is implicitly performing a "landmark registration". It is also easy to 
verify that w'^Tk) = dik- It can be shown that for any coefficient vector Cj with a — Cjo < 
en < ■ ■ ■ < Cj )r+ i = 6, a set of derivatives can be constructed so that Wi(t) is strictly 
increasing in [a, b]. Fritsch and Carlson (1980) give necessary and sufficient conditions 
for this, and present a simple one-pass algorithm to construct a set of d ik s that satisfy the 
sufficient conditions. We will use Fritsch-Carlson algorithm in this paper, so the d ik s will 
not be free parameters but functions of the QfcS. Then Wi(t) is entirely parameterized by Cj 
for a given knot sequence r. 

For computational purposes it is convenient to transform the constrained monotone 
vector (cjx, . . . , c ir ) into an unconstrained vector 0j via the (invertible) Jupp transform 
(Jupp, 1978): 6 ik = log{(c iife+ i - c ifc )/(c ifc - c iik -i)}, for k = 1, . . . ,r. The minimiza- 
tion problem in step (ii) of the algorithm presented in Section |2] then becomes a simple 
unconstrained r-dimensional minimization problem, albeit a non-linear one. 

Finally, we note that although we treat r as a fixed sequence of knots in this paper, 
it would not be unreasonable to treat r as a free parameter and estimate it as well. The 
dependence of the basis functions {%} and {£ fe } on r is not complicated, and explicit 
formulas for their derivatives with respect to the knots are easy to obtain, so a Newton- 
Raphson algorithm to find the optimal r could be implemented. However, to keep the 
paper focused, we will not pursue the free-knot alternative here. 

4 Asymptotics 

There are two approaches to the asymptotic analysis of dynamic regression estimators. 
The "nonparametric" approach would investigate the rate of convergence of $ n (s,t) to 
the true regression function [3 (s, t) when the number of basis functions p increases as n 
goes to infinity, using techniques as in e.g. Van de Geer (2000, ch. 10); but this is not 
what we are interested in in this paper. We will follow a "parametric" approach: we will 
hold p fixed and make the smoothing parameter A n go to zero as n goes to infinity [as in 
Knight and Fu (2000)]. Then the basis coefficients h n will converge at the usual n 1 / 2 rate 
to a b* such that /3*(s, t) = b* T 0(s, t), which in general will not be the true P (s, t) but 
hopefully will be close to it. This simplification ignores estimation bias but allows us to 
derive a simple estimator of the asymptotic variance of $ n (s, t). 

Let us first introduce some notation. The warping functions are of the form w(t, 0,^), 



8 



with 6i G W an unconstrained parameter. The minimization © is carried out simultane- 
ously on b and the 0jS, but for the asymptotic analysis it is best to see © as a profile least 
squares problem: let 

r;(0,b)(i) = yi o W (t,0) - /i y (t) - J b T 4>(s,t){xi o w(s, 6) - fx x (s)}ds, 
then for each b define 

0i(b) :=argmin \\n(0, b) || 2 (9) 



and 

/<(>)(*) =r i (0 i (b),b)(*). (10) 



Then 



1 - 

b n = argmin- V \\f t (b)\\ 2 + A n b T ftb. (11) 
b n 

i=i 

For simplicity, we are going to assume fi x and fx are fixed and known. As n — > oo and 
A n — )• 0, b n will converge in probability to 

b*= argmin E {||/,(b)|| 2 }, 

b 

where E denotes expectation with respect to the true model: 

y(t) = fiy(t)+ [ o (s,t){x(s)-fx x {s)}ds + £(t), (12) 

J a 

(x(t),y(t)) = {xow-^yow-^t)), 

for a warping process w(t) and a regression slope f3 (s, t) that may not necessarily belong 
to the families used for estimation. If w and (3 do belong to the families used for estima- 
tion, say f3 (s, t) = <f>(s, t), one expects b* to be close to b , but b* will be equal to 
bo only in the ideal case that e(t) = 0. This is an unfortunate drawback of least-squares 
registration methods: they need exact specification of the model in order to be consistent 
[see e.g. Proposition 1 of Kneip and Ramsay (2008)]. 

Theorem 1 Ifs/riX n -»• £ as n -»• oo, then ^(h n - b*) A A^-^r^fib*, r^Ar -1 ) 
with 

r = E {<f i (b*),f i (b*) T }}, 

A = E {(/ i (b*),f i (b*)}(/ i (b*),f i (b*) T }}, 
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where fj(b)(t) is the gradient of /j(b)(t) with respect to b. 

The proof of Theorem [T]is given in the Appendix. A more explicit expression for fi(b) 
can be obtained as follows. Let D e and D b denote the differentials with respect to the 
variables 6 and b, so fj(b) = D b /j(b) T . Then we have 

D b /i(b)(f) = {D r i (0 i (b) ) b)(t)}{D b 6> i (b)} + D b r i (6> i (b),b)(t), 

where 

T> e r i (6,h)(t) = tf i tw(t,0))D e w(t,0)- [ h T cf>(s,t)x' i (w(s,e))D e w(s,e)ds 

J a 

and 

D b n(0,b)(t) = - [ <f>(s,t) T {xiO W (s,e) - fj. x (s)}ds. 

J a 

From Section |3] we have 

D e w(t,0) = rj(t;T) T D e J- 1 (0) 

with rj(t; t) t = (^(t; r), . . . , r) r (t; r)) and J _1 (0) the inverse Jupp transform. To obtain 
D b ; (b) note that © implies 

(nidiib), b), Dertfiib), b)) = T for all b, 

so D b 0j(b) can be derived via the Implicit Function Theorem: ifFj(0, b) = (rj(0, b), Dor^O, b) T ), 
then 

D b ?; (b) = -{D F i (^(b),b)}- 1 {D b F i (^;(b),b)}, 

where 

D F, ; (0,b) = (D r J (0,b) T ,D r J (0,b)) + ( r< (0, b), H n(6>, b)) 
(Hg denotes the Hessian) and 

D b F. 4 (0,b) = (D e r i (6>,b) T ,D b r i (6>,b)) + (r 4 (6>, b), D b {D e r,(6>, bf}). 

In general, the above derivatives must be computed during the implementation of the 
algorithm proposed in Section |2] so no extra computational complexity is incurred in es- 
timating r and A. The matrices T and A can be consistently estimated by their sample 
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counterparts, 



n 



r 



n 



n 



-1 



i=l 



n 



A. 



•it 



n 



-l 



J2(fi(K)MK)){fi(bn),fI(b n )). 



i=l 



The approximate variance of (3 n (s,t) is then v n (s,t) = 0(s, t) T T n 1 A n f „ 1 </>(s, t)/n, 
which can be used e.g. to test whether 0*(s, t) = for given (s, t). 

5 Simulations 

We ran a number of simulations to assess the comparative performance of the proposed 
method and the common causal functional regression model. The data was generated from 
the following model: 



with slope o (s,t) = 5e~ 50 ^ s - A ^ + ^-- 6 ^I{s < t}, covariates x^s) = ^e" 30 ^"- 4 ) 2 with 
Zi i.i.d. N(l, .l 2 ), and error term ei(t) = Ui sin(47rt) with Ui i.i.d. A^(0, .05 2 ). As warping 
functions we used the one-dimensional exponential family w 0i (t) = [e a%t — l)/(e ai — 1), 
with cijS equally spaced between —1 and 1. This is referred to as "Model 1" in Tabled] Ten 
random pairs (a^, y^) are shown in Figure |2] for illustration. The effect of the regression 
model is, roughly, to shift the peaks from around .4 to around .6. 

The performance of an estimator can be evaluated from two different perspectives: 
the goodness of the estimation of (3 and the goodness of the prediction of new ys. As 
measure of estimation error we use the integrated squared error ISE(/9) = ff {(3(s, t) — 
(3(s,t)} 2 ds dt, and as measure of prediction error we use the mean squared prediction 
error MSPE(/3) = N' 1 J2^ = i hn+i ~ y n +i\\ 2 , where y n+1 , . . . , y n+N are N additional ob- 
servations generated from model (fT3T )- (fi~4l but without error term (we used = 200). It is 
also of interest to study the predicting ability of the estimators for observations that follow 
model (fT3T)-([T4"l) but whose warping functions fall outside the range of the training data. 
To this end, we generated samples with warping parameters equally spaced between — 1 
and 1 for estimation but between 1.5 and 2.5 for prediction. This is referred to as "Model 
2" in Tabled] 




(13) 



(xiOWQ^yiOWQi 1 ), 



(14) 
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Figure 2: Simulated data. Ten sample curves simulated from Model 1. (a) Covariates; (b) 
responses. 
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(.001) 


(.030) 


(.003) 


(.044) 


(.300) 


(.042) 


(.140) 


(200,10) 


.211 


.736 


3.86 


3.99 


.796 


.066 


1.65 


.229 


5.15 


41.4 


5.94 


51.4 




(.001) 


(.0003) 


(.063) 


(.015) 


(.005) 


(.0006) 


(.021) 


(.002) 


(■035) 


(.209) 


(.050) 


(.150) 



Table 1: Simulation Results. Mean estimation errors and mean prediction errors of dy- 
namic and common causal regression estimators (Monte Carlo standard errors in paren- 
thesis). 

As basis functions for the regression slope we used a truncated-product cubic B-spline 
basis with k equally spaced knots in (0, 1). The values k = 5 and k = 10 were con- 
sidered, for which the number of basis functions were p = 66 and p = 141, respec- 
tively. As warping family for the dynamic estimator we used cubic Hermite splines with 
a single knot at T\ = .5. The smoothing parameter A was chosen by AICC for both es- 
timators. Specifically, we minimized AICC(A) over As of the form I0~ u with v in the 
grid {1, 1.25, 1.50, . . . , 5}. The minimum AICC(A) was attained strictly inside the range 
(1CT 5 , 1CT 1 ) in all cases, so there was no spurious bias due to truncation of the range of A. 

The mean ISE(/3) and the mean MSPE(/3) x 10 3 , based on 500 replications of each 
scenario, are reported in Tabled] We also included in Table [T] the smallest estimation and 
prediction errors attainable for the optimal A. This allows us to evaluate the ability of the 
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AICC criterion to select a A close to the optimum. As expected, Table \T\ shows that the 
classical estimator cannot provide accurate prediction of new ys and accurate estimation 
of /3 simultaneously. We found that for all the simulated samples, the classical estimator 
invariably attains the lowest prediction error at the smallest A in the grid (for which (3 is so 
irregular that it does not show any resemblance to the true (3), while the smallest estimation 
error is attained at the largest A in the grid (for which $ is practically zero). In other words, 
the low MSPE values that the classical estimator attains for "Model 1" are entirely due to 
undersmoothing, and are attained by essentially meaningless j3s. In contrast, the dynamic 
regression estimator attains low prediction error and low estimation error simultaneously. 
The predicting ability of the dynamic estimator is most clear for "Model 2", for which the 
MSPE values are an order of magnitude smaller than those of the classical estimator. 

6 Application: Lip Movement Data 

In this section we apply the dynamic causal regression method to the data of Malfait and 
Ramsay (2003) shown in Figure [Q As explained in Section [0 the goal is to predict lip 
acceleration using the EMG curves of lip neural activity. This data is hard to analyze for 
a number of reasons: the curves have sharp and localized features; the first EMG spike 
occurs near 0; there is a lot of time variability; and there are only 29 curves in the sample 
after removing 3 obvious outliers. Our strategy is then to model (3 using a relatively large 
number of well-localized basis functions: we used truncated-product linear B-splines with 
10 equispaced knots on each axis, which makes a total of p = 64 basis functions. As 
warping functions we used cubic Hermite splines with knots r = (0, .05, .2, .4, .5, .69). 
We also tried other knot sequences but the results did not change much; what is most 
important for this data is that some kind of warping be used, regardless of the exact family 
of warping functions. 

In this case the AICC criterion is not very useful for choosing the penalization param- 
eter A, because among As with df (A) < n, AICC(A) always decreases as A increases. The 
same behavior was observed for other choices of spline order and number of knots. The 
likely reason for this behavior, besides the small sample size, is that after synchronization 
there is little variability left to explain. Most of the predictive ability of the explanatory 
curves resides in their dynamics, not in their amplitude. So we chose A = 10~ 3 subjec- 
tively. The resulting warped curves are shown in Figure |3]and the estimated regression 
slope in Figure |4] 

We see that dynamic regression does a good job at aligning these curves. The features 
of both explanatory and response curves emerge more clearly in Figure |3] than in Figure 
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Figure 3: Lip Movement Example, (a) Synchronized EMG curves; (b) synchronized lip 
acceleration curves. 

CD In particular the peaks of the EMG curves around times .4 and .5, that were barely 
discernible in Figure (Ha), are more clear-cut in Figure 0a). These peaks correspond to 
the agonistic and antagonistic actions of the lower lip muscle that mark the beginning and 
the end of the 'b'. 

The estimated regression slope j3(s, t) is shown in Figure Ufa). To facilitate interpre- 
tation we plotted in Figure Hfb) the filtered values /3(s, t)I{\/3(s, t)\ > 2y/i)(s, £)}, where 
v(s,t) is the variance of $(s,t) estimated by bootstrap. Contour plots of these surfaces 
are shown in Figures Sfc) and Hfd). It is not surprising that the sharpest peaks of f3 are 
located on the diagonal near the peaks of the EMG curves, because neural activity has an 
immediate effect on muscle acceleration. But more interestingly, $ shows some significant 
off-diagonal values. The most noticeable is the crest at input time s between .2 and .3 and 
output time t = .45, indicating that the neural activity that takes place while the 'o' in 
'Bob' is being produced has an effect on the production of the subsequent 'b'. This is to 
be expected, since the position reached by the lower lip for the production of the 'o' will 
determine to a large extent the force necessary to close the lip to produce the second 'b'. 
A similar effect was observed by Malfait and Ramsay (2003), but it was less pronounced 
due to the blurring effect of time variability. 

More information about the dynamics of the process can be extracted from the warp- 
ing functions. The use of interpolating Hermite splines facilitates this because the basis 
coefficients Cj satisfy Cj = Wi{r), as explained in Section [3l so they can be interpreted 
as the locations on the curve i of the landmarks corresponding to r. For our choice of 
r, c i2 to c i5 will roughly correspond to the location of the four characteristic peaks of the 
EMG curves, as can be seen in Figure |3£a). Thus, the differences da = c i3 — c i2 and 
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Figure 4: Lip Movement Example, (a) Dynamic regression slope estimator; (b) dynamic 
regression slope estimator filtered by significance; (c) contour plot of slope estimator; (d) 
contour plot of filtered slope estimator. 
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di3 — c i5 ~ c i4 indicate the durations of the two 'b's and d i2 = c i4 — c i3 the duration of 
the 'o'. The pairwise correlations of the differences are p 12 = —.41, p 23 = —.66 and 
p 13 = .26, with bootstrapped standard errors .19, .10 and .21. Then we see that there is a 
significant (and negative) correlation between the duration of two consecutive phonemes, 
but there is no significant correlation between the duration of the two 'b's. Of course, more 
accurate information about the phonemes' duration would be obtained by identifying the 
exact peak locations curve by curve; that would be possible for these 29 curves, but would 
be unfeasible for larger datasets. The advantage of our method is that the parameters c, 
are estimated automatically. 
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Appendix 

Proof of Theorem 3] 

The proof employs standard empirical-process techniques [as in e.g. Van de Geer (2000)] 
and essentially mimics the proof of Theorem 2 in Knight and Fu (2000). Let u n = 

\/n(b n - b*) and 

rt 

Z n (u) = ^ ||/i(b* + u/v^)f + n\ n (b* + u/^fn(b* + u/y/n). 

i=i 

Clearly, it follows from (fTTT) that 

u n = argminZ n (u) - Z n (0), 

since Z n (0) is constant in u. A simple Taylor expansion of (b) at b* implies that Z n (u) — 
Z n (0) = V n {u) + o P (l) with 

v n (u) = u T |igf,(b*)f,(br|u+2|^|j/ J (b*)f J (br|u 

+A n u T fiu + 2y/n\ n u T CLh*. 
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Then, if y/n\ n — > £ as n — > oo, it follows that V n — » V, where V is the stochastic process 
given by 

V(u) = u T Tu + 2w T u + 2£ u T nb*, 

withw ~ N(0, A) (note that Eo{/i(b*)£i(b*)} = because b* minimizes E {||/i(b)|| 2 }.) 
The Argmin Theorem implies that 

u„ A argmin \/(u) = -r _1 (w + ^nb*) 

and this completes the proof. 
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